Optimize traverse #4498

TimWSpence · 2023-08-23T16:08:54Z

This addresses the StackSafeMonad part of #4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.

Open questions (will edit these out of the description later):

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

The following benchmarks were run on an AWS t3.xlarge:

Baseline (`b08196e`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3247.486	± 270.817	ops/s
TraverseBench.filterVector	10000	thrpt	20	3351.585	± 255.883	ops/s
TraverseBench.mapList	10000	thrpt	20	3198.905	± 235.865	ops/s
TraverseBench.mapVector	10000	thrpt	20	3906.339	± 225.508	ops/s
TraverseBench.traverseChain	10000	thrpt	20	556.706	± 31.814	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1278.954	± 74.242	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	536.841	± 38.696	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	532.103	± 30.473	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	590.312	± 29.534	ops/s
TraverseBench.traverseList	10000	thrpt	20	537.023	± 33.287	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1674.930	± 48.771	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.844	± 32.074	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1778.334	± 135.859	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	490.822	± 14.145	ops/s
TraverseBench.traverse_List	10000	thrpt	20	474.568	± 28.995	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	622.924	± 38.002	ops/s

Chain (`2d5f4d7`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3170.974	± 146.628	ops/s
TraverseBench.filterVector	10000	thrpt	20	3217.881	± 181.087	ops/s
TraverseBench.mapList	10000	thrpt	20	3443.969	± 104.117	ops/s
TraverseBench.mapVector	10000	thrpt	20	3827.401	± 148.074	ops/s
TraverseBench.traverseChain	10000	thrpt	20	784.845	± 25.990	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1302.782	± 62.342	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	748.986	± 62.151	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	669.817	± 41.798	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	618.644	± 28.558	ops/s
TraverseBench.traverseList	10000	thrpt	20	583.413	± 27.079	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1253.510	± 76.328	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.975	± 27.619	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1302.362	± 66.015	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3598.033	± 108.420	ops/s
TraverseBench.traverse_List	10000	thrpt	20	4048.287	± 158.090	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	4094.024	± 101.115	ops/s

Vector (`702ab8b`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3305.299	± 233.483	ops/s
TraverseBench.filterVector	10000	thrpt	20	3253.531	± 133.993	ops/s
TraverseBench.mapList	10000	thrpt	20	3308.961	± 138.697	ops/s
TraverseBench.mapVector	10000	thrpt	20	3911.187	± 142.582	ops/s
TraverseBench.traverseChain	10000	thrpt	20	594.535	± 29.777	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1084.548	± 75.686	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	642.895	± 55.621	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	682.965	± 41.234	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	696.235	± 59.945	ops/s
TraverseBench.traverseList	10000	thrpt	20	582.275	± 35.877	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1114.959	± 52.641	ops/s
TraverseBench.traverseVector	10000	thrpt	20	638.235	± 42.533	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1098.038	± 62.946	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3480.086	± 178.654	ops/s
TraverseBench.traverse_List	10000	thrpt	20	3950.650	± 171.441	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	3695.495	± 182.401	ops/s

StackSafeMonads

Also fix 2.12 errors

different runtime types for the Applicative instance

armanbilge · 2023-08-23T16:11:45Z

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.

Should I open a separate PR for the new benchmarks first?

Nah.

johnynek · 2023-08-23T16:41:07Z

Can you post the results of the benchmarks? Apologies if I missed them.

johnynek · 2023-08-23T16:31:42Z

core/src/main/scala/cats/Traverse.scala

+    val iter = fa.iterator
+    if (iter.hasNext) {
+      val first = iter.next()
+      G.map(iter.foldLeft(f(first)) { case (g, a) =>


I think this can be G.void(.. so that has a chance for an optimized implementation.

Yes, good catch!

johnynek · 2023-08-23T16:40:14Z

core/src/main/scala/cats/TraverseFilter.scala

+    G.map(fa.iterator.foldLeft(G.pure(builder)) { case (bldrG, a) =>
+      G.flatMap(bldrG) { bldr =>
+        G.map(f(a)) {
+          case Some(b) => bldr += b


I can't see why we know this is safe or lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

It can't be: every append to the builder happens in flatMap so by law it must be sequential, not concurrent/parallel.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

Hmm. I'm not entirely sure how this applies to traverse. It doesn't have a notion of "recovery". For sure, each individual step that the traverse runs may have a notion of recovery, but that's just it: it will either succeed or it will fail. But there's no way to "recover" an intermediate structure from the traverse itself.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Won't disagree here. We could do a benchmark to see how much performance we are leaving on the table with that strategy.

TimWSpence · 2023-08-23T18:03:28Z

Can you post the results of the benchmarks? Apologies if I missed them.

Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away

TimWSpence · 2023-11-05T16:18:53Z

@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?

Those traverse_ improvements though 😆

Is this necessary? The monad laws should ensure that it's safe to use mutable builders. Nonetheless it will be good to confirm the performance delta for using immutable data structures

johnynek · 2023-11-08T22:39:23Z

Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

armanbilge · 2023-11-08T23:34:36Z

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

armanbilge · 2023-11-08T23:38:37Z

core/src/main/scala/cats/Traverse.scala

+  private[cats] def traverseDirectly[G[_], A, B](
+    fa: IterableOnce[A]
+  )(f: A => G[B])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
+    fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (accG, a) =>


I think we can do even better here by using a Vector.newBuilder, we can do even even better by using a dedicated builder for the desired collection (e.g. ArraySeq or List) and we can do even even even better by providing a size hint to the builder.

I might have misunderstood something, but it looks like that's how things were done before the commit 8cc742e was introduced, which added the immutable data structures. 🙂

You're right :) let's just say this PR has been open a while and I lost track of the old stuff 😇

johnynek · 2023-11-09T01:56:41Z

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.

For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.

For instance, consider this type:

sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]

object Tree {
  def pure[F[_], A](a: A): Tree[F, A] = Item(a)
  def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
    fa match {
      case Item(a) => fn(a)
      case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
      case Branch(left, right) =>
        Branch(flatMap(left)(fn), flatMap(right)(fn))
    }
}

Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.

Couldn't something like this cause a problem?

I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.

To see if this is more performant than an immutable vector

johnynek · 2023-11-10T19:55:04Z

core/src/main/scala/cats/Traverse.scala

+    if (iter.hasNext) {
+      val first = iter.next()
+      G.void(iter.foldLeft(f(first)) { case (g, a) =>
+        G.flatMap(g) { _ =>


why not G.productR(g, f(a)) here? Is it to avoid calling f when g may have already failed? I think a comment would be helpful.

There was no good reason 😅 Thanks, I'll change it.

Is it to avoid calling f when g may have already failed?

Maybe this is something we should consider though?

Edit: ahh, I see your comment on f95b087. Fair enough.

johnynek · 2023-11-10T19:55:49Z

core/src/main/scala/cats/TraverseFilter.scala

+    fa: IterableOnce[A]
+  )(f: A => G[Option[B]])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
+    fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (bldrG, a) =>
+      G.flatMap(bldrG) { acc =>


add a comment on why we didn't write:

G.map2(bldrG, f(a)) { case (acc, Some(b)) => acc :+ b case (acc, _) => acc }

Thanks, good point 👍

johnynek · 2023-11-10T19:56:44Z

core/src/main/scala/cats/Traverse.scala

+    fa: IterableOnce[A]
+  )(f: A => G[B])(implicit G: StackSafeMonad[G]): G[List[B]] = {
+    G.map(fa.iterator.foldLeft(G.pure(List.empty[B])) { case (accG, a) =>
+      G.flatMap(accG) { acc =>


add a comment why not G.map2(accG, f(a))(_ :: _)

Since we know G is lazy (a StackSafeMonad has to be I think), I'm not sure the downside here. One answer would be we don't have to call f if we have already failed for a short-circuiting monad, but we are still iterating the whole list, so we are doing O(N) work. Adding the call to allocate the Monad doesn't seem like a big problem, since we have to allocate the function to pass to flatMap in the current case.

By calling map2 we are at least communicating to G what we are doing, and in principle, some monads could optimize this (e.g. a Parser can make a more optimized map2 than flatMap, and it can also be StackSafe since runs lazily only when input is passed to the resulting parser).

TimWSpence · 2024-03-07T17:12:19Z

@valencik absolute legend ❤️ Couple of points:

I'm dumb and not thinking straight 😅 I was thinking the Map implementation would be dependent on either Chain or Vector and so I was waiting for the outcome of that debate. I'll fix it now
map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...
Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

TimWSpence · 2024-03-18T10:17:39Z

@valencik sorry to bother you. Just tagging you in case you missed my previous comment

valencik · 2024-03-20T23:36:33Z

2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

TimWSpence · 2024-03-21T16:49:48Z

map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now

TimWSpence · 2024-03-22T10:57:29Z

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

valencik · 2024-03-22T12:52:23Z

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Just to double check I understand, do you mean the implementation of traverseDirectly?

Yeah, let's switch that to Chain and I can rerun benchmarks.

valencik · 2024-03-22T13:30:18Z

Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly
https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1

valencik · 2024-03-22T23:40:06Z

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence
TimWSpence#2

Feel free to leave it for a day or two.
There's a ton of numbers here.
I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements.
But again, maybe we should save that for a follow up.

TimWSpence · 2024-03-25T08:34:56Z

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence TimWSpence#2

Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements. But again, maybe we should save that for a follow up.

Thanks so much again! Presumably that PR is the same as the Chain version that I've had in place multiple times in the history of this PR? 😆

How is traverseDirectlyVector different sorry?

Use `Chain` in `traverseDirectly` helpers

valencik · 2024-03-26T12:28:44Z

Before we get into the weeds, my position remains that we should merge this PR if we are ok with the small regression to traverseError benchmarks.

@djspiewak Pinging you as the creator of these benchmarks, what do you think here? We've improved or held stable on on fronts except traverseError benchmarks.
Which, as a reminder, look like:

  @Benchmark
  def traverseVectorError(bh: Blackhole) = {
    val result = vectorT.traverse(vector) { i =>
      Eval.later {
        Blackhole.consumeCPU(Work)

        if (i == length * 0.3) {
          throw Failure
        }

        i * 2
      }
    }

    try {
      bh.consume(result.value)
    } catch {
      case Failure => ()
    }
  }

Click for a visual summary of benchmark results main vs this pr

My gut says these improvements are worth it and we should merge.

How is traverseDirectlyVector different sorry?

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

The commit: efa7ec0

Click for a visual summary of benchmark results this pr vs traverseDirectlyVector

TimWSpence · 2024-03-26T16:54:48Z

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

Ah cool, thanks. Yeah I had asked about that before and I think we decided to leave that for potential follow-up

TimWSpence · 2024-04-06T09:20:22Z

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

valencik · 2024-04-06T11:36:03Z

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

We need to make a call on whether or not we're ok with the small regression in the traverseError micro benchmarks.
Otherwise, I think we're good.

johnynek · 2024-04-26T17:22:19Z

core/src/main/scala/cats/instances/list.scala

+                val (leftL, rightL) = fa.splitAt(leftSize)
+                runHalf(leftSize, leftL)
+                  .flatMap { left =>
+                    val right = runHalf(rightSize, rightL)


an interesting benchmark would be do do Eval.defer(runHalf(rightSize, rightL)) which would short circuit for applicatives G that can short circuit.

I wonder how much that would hurt the non-short circuiting case... this is not for this PR, just something we may come back to in a later PR.

johnynek · 2024-04-26T17:27:43Z

core/src/main/scala/cats/instances/vector.scala

+                val rightSize = size - leftSize
+                runHalf(leftSize, idx)
+                  .flatMap { left =>
+                    val right = runHalf(rightSize, idx + leftSize)


same comment about the possibility to short circuit the right with Eval.defer(runHalf(rightSize, idx + leftSize)) but that would add cost to the cases where we do need to evaluate everything, which is probably the most common case.

johnynek · 2024-04-26T17:46:20Z

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

johnynek · 2024-04-26T18:05:45Z

btw: I was the person who implemented the tree style traverse in cats (having noticed it by starting with the TreeList datastructure I added to cats-collections: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/TreeList.scala)...

I now wonder if the implementation of two strategies in traverseViaChain actually is helping... I can't see why at the moment... There is a tree portion, and then a linear portion, the linear portion is currently set at 128 wide. The only motivation I can see for the linear portion is that may be cheaper to build a List than a Chain, but I don't recall if I benchmarked that.

For any future readers, benchmarking that code with a different width setting (say 32, 64, 128 (current), and 256) might be interesting, along with a version that is a tree all the way down (similar to the implementation of traverse_ in Vector).

If it is very close, it would be worth simplifying the code to only use the binary index tree algorithm for sure.

TimWSpence · 2024-05-01T13:40:10Z

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

I can absolutely do that. Just for my own benefit I'm curious to understand the preference for keeping the performance in the exception-throwing case and sacrificing the improvement in the happy path case?

johnynek · 2024-05-01T14:30:00Z

Actually sorry. I misread the plots. I was looking at this pr vs traverseDirectlyVector I think (the second set of metrics) but the relevant set is this pr vs main which were the first set of metrics.

Okay, yeah, it seems there are a lot of wins. Seems the success case is about as much faster as the error case is slower and since success should be more common than failure seems like we should merge.

We can merge as is. IMO.

Shall I click merge? Do we need to wait for anyone else?

TimWSpence · 2024-05-01T16:27:08Z

Actually sorry. I misread the plots. I was looking at this pr vs traverseDirectlyVector I think (the second set of metrics) but the relevant set is this pr vs main which were the first set of metrics.

Okay, yeah, it seems there are a lot of wins. Seems the success case is about as much faster as the error case is slower and since success should be more common than failure seems like we should merge.

We can merge as is. IMO.

Shall I click merge? Do we need to wait for anyone else?

Thank you! I'm happy to merge 😆 I think @valencik was as well? Also sorry that there are so many versions of the benchmarks - it's gotten quite confusing

valencik · 2024-05-01T19:20:03Z

Thank you! I'm happy to merge 😆 I think @valencik was as well? Also sorry that there are so many versions of the benchmarks - it's gotten quite confusing

LET'S DO IT!! 🎉 🚀 🐱

TimWSpence added 15 commits August 18, 2023 17:22

Optimize List traverse for stack-safe Monads

858cca4

Optimize {List, Vector} traverse for StackSafeMonads

847df54

Optimize {Queue, Map} traverse for StackSafeMonads

878c4d0

Optimize Seq traverse for StackSafeMonads

9d86c98

Optimize ArraySeq traverse for StackSafeMonads

c4bcba2

Optimize Chain traverse for StackSafeMonads

c4c64ec

Add size hint to StackSafeMonad traverse

1344a70

Optimize List traverseFilter for StackSafeMonads

b6973dc

Optimize {Seq, Queue, Vector, ArraySeq} traverseFilter for

f0648c4

StackSafeMonads

Optimize Chain filterTraverse for StackSafeMonads

dbccc8e

Optimize List traverse_ for StackSafeMonads

fbf5ea5

Optimize Vector traverse_ for StackSafeMonads

9f60d2b

Also fix 2.12 errors

Optimize {Seq, ArraySeq, Chain, Queue} traverse_ for StackSafeMonads

f4fcf57

Add extra laws tests so we test the multiple branches corresponding to

d50bf83

different runtime types for the Applicative instance

Add benchmarks for traverse_ and for Chain

5c5049f

johnynek reviewed Aug 23, 2023

View reviewed changes

Use .void over .map(_ => ()) to give instances chance to optimize

a1c9ff5

TimWSpence force-pushed the optimize-traverse branch from b92a558 to b338f18 Compare November 8, 2023 12:18

Experiment: use immutable data structures in optimized traverse

8cc742e

Is this necessary? The monad laws should ensure that it's safe to use mutable builders. Nonetheless it will be good to confirm the performance delta for using immutable data structures

TimWSpence force-pushed the optimize-traverse branch from b338f18 to 8cc742e Compare November 8, 2023 12:51

armanbilge reviewed Nov 8, 2023

View reviewed changes

Experiment: use immutable List for optimized traverse

e3ae584

To see if this is more performant than an immutable vector

johnynek reviewed Nov 10, 2023

View reviewed changes

Remove mutable builder from Map traverse as its unlawful

0ee49e6

TimWSpence marked this pull request as ready for review March 21, 2024 16:49

valencik added 2 commits March 22, 2024 09:28

Use Chain in traverseDirectly

6529be7

Use toList in seq traverse, not vector

a62ce4c

Merge pull request #2 from valencik/more-chain

6cb787d

Use `Chain` in `traverseDirectly` helpers

TimWSpence requested review from armanbilge and johnynek April 26, 2024 13:28

johnynek reviewed Apr 26, 2024

View reviewed changes

johnynek approved these changes May 1, 2024

View reviewed changes

johnynek merged commit ffb1df6 into typelevel:main May 1, 2024
16 checks passed

TimWSpence deleted the optimize-traverse branch May 2, 2024 08:12

Optimize traverse #4498

Optimize traverse #4498

Conversation

TimWSpence commented Aug 23, 2023 • edited

Baseline (b08196e)

Chain (2d5f4d7)

Vector (702ab8b)

armanbilge commented Aug 23, 2023 • edited

johnynek commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimWSpence commented Aug 23, 2023

TimWSpence commented Nov 5, 2023

johnynek commented Nov 8, 2023

armanbilge commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armanbilge Nov 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimWSpence commented Mar 7, 2024

TimWSpence commented Mar 18, 2024

valencik commented Mar 20, 2024

TimWSpence commented Mar 21, 2024

TimWSpence commented Mar 22, 2024 • edited

valencik commented Mar 22, 2024 • edited

valencik commented Mar 22, 2024

valencik commented Mar 22, 2024

TimWSpence commented Mar 25, 2024

valencik commented Mar 26, 2024

TimWSpence commented Mar 26, 2024 • edited

TimWSpence commented Apr 6, 2024

valencik commented Apr 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Apr 26, 2024 • edited

johnynek commented Apr 26, 2024

TimWSpence commented May 1, 2024

johnynek commented May 1, 2024

TimWSpence commented May 1, 2024 • edited

valencik commented May 1, 2024

TimWSpence commented Aug 23, 2023 •

edited

Baseline (`b08196e`)

Chain (`2d5f4d7`)

Vector (`702ab8b`)

armanbilge commented Aug 23, 2023 •

edited

armanbilge Nov 14, 2023 •

edited

TimWSpence commented Mar 22, 2024 •

edited

valencik commented Mar 22, 2024 •

edited

TimWSpence commented Mar 26, 2024 •

edited

johnynek commented Apr 26, 2024 •

edited

TimWSpence commented May 1, 2024 •

edited